REMC isoform analysis - junctions

Gloria Li
Tue Jun 24 16:13:31 2014

Validate previously identified isoforms with junction RPKM

  • Previous isoform identification with DE exons
    • DE exons by DEfine FDR = 0.015
    • Exon RPKM \(\ge\) 10% gene RPKM in one sample & \(\le\) 1% in the other
    • Gene RPKM of both samples > 0.005
    • Exclude DE genes by DEfine FDR = 0.015
  • Validation: For each isoform exon in the previous pairwise comparison
    • Find junctions associated with this exon with enough coverage, i.e. sum of junction coverage of two samples \(\ge\) 1
    • Identify junctions that RPKM change in the same direction as the exon
    • Junction RPKM > 0.1 in one sample and < 0.1 in the other

Results:

Compare strand specific and non-strand specific libraries

  • Non-strand specific libraries (RM080 and RM035) have much lower junction coverage than strand specific libraries (RM084).
  • Junction RPKM between strand specific and non-strand specific libraries are much more comparable.

plot of chunk strand-specifcplot of chunk strand-specifc

Junction RPKM clustering

  • Sample clustering on junction RPKM with spearman correlation shows clustering on cell types as well.

plot of chunk cluster

Isoform junction validation

  • For strand specific libraries, about 40% of previously identified exons (50% genes) have junctions with enough coverage for validation. However, more than 99% of them have junction support.
  • For non-strand specific libraries, only about 20-30% isoform genes have enough junction coverage for validation, and among them 75-80% genes have support from junction reads.
  • Venn diagram of isoforms in RM080 is similar to RM084, with the common isoforms shared by different cell types having less ratio of being validated.
  • Venn diagram of isoforms between same cell types in different individuals show similar patterns. The majority of isoforms are shared among different individual. However, isoforms in strand specific libraries (RM084) have much higher ratio of being validated.
No.isoform.exons No.isoform.genes No.exons.with.junction.cov No.genes.with.junction.cov No.exons.with.junction.support No.genes.with.junction.support
lum084_myo084 8630 2381 3618 1228 3604 1217
lum084_stem084 8948 2429 4226 1411 4217 1404
myo084_stem084 8619 2427 3346 1247 3333 1238
lum080_myo080 12871 2325 1687 509 1235 397
lum080_stem080 8688 2345 1559 686 1232 559
myo080_stem080 8448 2390 1105 548 791 427
lum035_myo035 12911 2341 3536 817 2324 619

plot of chunk validateplot of chunk validateplot of chunk validateplot of chunk validateplot of chunk validateplot of chunk validateplot of chunk validateplot of chunk validateplot of chunk validateplot of chunk validate

No. of exons for DE genes / isoform genes

  • DE genes have roughly the same No. of exons as all expressed genes.
  • Identified isoforms have slightly more No. of exons than DE genes and all expressed genes.

plot of chunk Nexon

Isoform genes in DE genes with small No. of exons

If a gene has only a few exons (\(\le\) 5 exons), and one exon is absent in one sample, i.e. isoform, the absence can bias the overall gene RPKM and this gene may be identified as a DE gene. Check if there are such cases.
For DE gene with small No. of exons:
* If it is in fact an isoform, not DE gene, only one / few exons in this gene should be differentially expressed.
* If it is a truly DE gene, all exons should be differentially expressed.

Proportion of DE exons for DE genes with \(\le\) 5 exons: (DE exons: fold change \(\ge\) 2)
* Proportion of DE exons < 1 for the few genes are due to exons not expressed in either samples.
* There is no evidence for isoforms genes identified as DE genes.

plot of chunk DE_isoform

Position of isoform exons on the gene

  • In general, there are more alternative spliced exons at the two ends of genes.

plot of chunk exon_pos * Exon usage along the gene for all multi-transcript genes

Venn Diagram with average expression level, average No. of exons and average exon length

  • Common isoforms shared among different comparisons have much lower ratio of being validated.
  • Isoforms have much lower expression level than all expressed genes.
  • On average, common isoforms between different comparisons have lower expression level than comparison-specific isoforms.
  • In general, compared to all isoforms identified, validated isoforms do not have lower expression levels. Our validation approach is not biased towards highly expressed genes.
  • Average No. of exons are very similar in different sections of the Venn diagram, between all, validated isoforms and all expressed genes.
  • Average length of isoform exons are shorter than all expressed genes, and validated isoform exons are sligtly shorter than all isoforms in general but the difference is not statistically significant (p = 0.39).

plot of chunk vennplot of chunk vennplot of chunk vennplot of chunk vennplot of chunk vennplot of chunk vennplot of chunk vennplot of chunk venn

Enrichment of all isoform genes and validated isoform genes for each section on the Venn diagram

  • Multifunctional correction by ermineJ eliminates general terms and brings more specific terms. Use ermineJ for all GO analysis instead of DAVID.

summary

plot of chunk enrich

lum084 vs myo084

plot of chunk enrich_lum084_myo084plot of chunk enrich_lum084_myo084

lum084 vs stem084

plot of chunk enrich_lum084_stem084plot of chunk enrich_lum084_stem084

myo084 vs stem084

plot of chunk enrich_myo084_stem084plot of chunk enrich_myo084_stem084

Common isoforms shared by lum084 vs myo084, lum084 vs stem084, and myo084 vs stem084

plot of chunk erminej_lm_ls_msplot of chunk erminej_lm_ls_ms

Examples for wet lab validation

RM084 lum vs myo

  • ENSG00000196208: GREB1, growth regulation by estrogen in breast cancer 1
  • ENSG00000008853: RHOBTB2, Rho-related BTB domain containing 2
  • ENSG00000108821: COL1A1, collagen, type I, alpha 1
  • ENSG00000110195: FOLR1, folate receptor 1 (adult)
  • ENSG00000138795: LEF1, lymphoid enhancer-binding factor 1
  • ENSG00000170312: CDK1, cyclin-dependent kinase 1

RM084 lum vs stem

  • ENSG00000064787: BCAS1, breast carcinoma amplified sequence 1
  • ENSG00000127084: FGD3, FYVE RhoGEF and PH domain containing 3
  • ENSG00000162894: FAIM3, Fas apoptotic inhibitory molecule 3
  • ENSG00000126217: MCF2L, MCF.2 cell line derived transforming sequence-like